Persistent registry; lifecycle management#351
Conversation
Treat removed workspace entries without a stored source as coming from MAIN_REPOSITORY_SOURCE when enforcing ensure_secure_source(). This closes a takeover gap for imported tombstones that lacked source data. Keep the denial message honest by showing the persisted workspace value in diagnostics. When source is missing, report "<not-set>" instead of a synthesized trusted source. Add a deny-rules test that covers removed entries without source and asserts both denial behavior and the new message wording.
When crawl_package() raises, keep existing workspace state but ensure source is present by defaulting from the registry package contract. This keeps security behavior stable for denied source moves while also repairing entries that never had a successful crawl and thus missed source entirely. Add a regression test that verifies failed crawls adopt source from the registry entry when the existing workspace entry has no source.
Exclude registry entries with a removed field from the scheduler in both normal and presto modes. Also block explicit --name crawls for tombstoned packages with a clear message so manual runs follow the same tombstone rule. Add focused scheduler tests that verify removed entries are skipped and that the next-run hint ignores tombstoned packages.
Keep --name handling simple and explicit: if the selected registry package is tombstoned, print a clear message and return without crawl. Add a focused regression test for main_() that verifies tombstoned packages are rejected in name mode and workspace remains unchanged.
Teach explain_main() to treat tombstoned registry entries explicitly. For tombstones, print a clear status line to stderr. In normal mode, print the raw entry as pretty JSON. In EFFECTIVE mode, print only the status line and no JSON payload since there is no effective release view. Keep this path simple by inlining the tombstone JSON print instead of adding a helper wrapper.
Move the effective explain logic and its helper functions out of crawl.py and into _explain_package.py so the explain-specific code lives together in one place. As part of that extraction, move the shared sublime_text selector parsing helpers into _utils.py so both crawl runtime logic and explain logic use the same implementation. Update the explain tests to import the helper-facing functions from _explain_package.
Teach maintenance() to copy tombstoned registry entries into workspace.packages before the legacy orphan-marking step. This keeps removed packages present in workspace and intentionally overwrites stale crawl-only fields with the canonical tombstone data. Add focused maintenance tests for tombstone import, overwrite behavior, and continued orphan removed marking.
Add a regression test that imports a tombstoned package via maintenance(), then runs main_() with an active registry entry for the same name. Verify resurrection works without special-case code: the package is crawled, removed is cleared, source remains stable, and first_seen is preserved.
Add describe_registry_changes.py to generate commit-message text from old/new registry snapshots. Implement change classification for both packages and libraries, including single-change messages, metadata bulk edits, and mixed bulk edits with additions, tombstones, and resurrections. Keep repositories out of primary classification, but fall back to "Update registry.json" when repositories change without any entity change. This keeps "Same." strict so it only appears when no commit is needed. Add focused tests for all supported classifications and fallback cases, using loader mocking for CLI tests.
Implement implicit seed loading in generate_registry based on --output, with explicit overrides via --seed and opt-out via --no-seed. In seeded mode, preserve package first_seen, synthesize tombstones for missing packages, preserve tombstone removed timestamps, and keep resurrection first_seen. Libraries remain non-tombstoned. Keep fetching_source_failed behavior intact and add focused registry tests that cover seed/no-seed behavior, tombstones, resurrection, library handling, and deterministic package ordering.
Simplify seeded lifecycle handling after initial implementation. Use pick() for seed extraction, inline package sorting by name, and remove a no-op removed-field cleanup path. Also ensure first_seen is populated when missing, including for tombstoned entries, while still preserving seeded first_seen when available.
Clarify generate_registry seed semantics in both CLI help and README. The docs now explain implicit vs explicit --seed behavior and the interaction with --no-seed plus fetching_source_failed. Add scripts.seed_from_workspace as a first-class script and document its usage inline with generate_registry. The script emits sparse output for optional fields and avoids writing null source values.
Introduce scripts.generate_seed as the new seed extraction command. It accepts exactly one input source via a required mutually exclusive flag: --workspace [PATH] or --registry [PATH]. Update README examples to use generate_seed and document both supported input modes. Remove the old seed_from_workspace script.
Add an incomplete-shape warning in generate_seed based on expected entry sizes (2 keys for active entries, 5 for tombstoned entries). The warning triggers when more than 10% of entries are incomplete. Special-case the all-incomplete scenario with a clearer message: "All packages have an incomplete shape".
Separate lifecycle seeding from source-failure recovery in generate_registry. Recovery of failed repositories now requires registry-shaped data and no longer reconstructs entries from workspace/seed maps. If an explicit seed is not registry-shaped, the command falls back to prior --output when available for recovery data. On fetch failures, emit a focused warning when the seed knows package names but no full recovery entries exist, with message text that reflects --no-seed behavior. Add regression tests for non-registry seed input and fallback-to-output recovery behavior.
Use has_registry_shape directly in resolve_failure_recovery_db without additionally checking available. The shape flag already encodes successful load plus registry-compatible structure. This keeps behavior unchanged while reducing redundant conditions.
Replace the SeedLoad available/null-object pattern with SeedDb | None. This removes ambiguous truthiness handling and makes seed presence explicit at call sites. Also rename read_seed_db(explicit=...) to strict=... to better reflect its behavior: strict mode raises on read/parse errors, while implicit mode returns None.
Simplify lifecycle seed handling by building the name->entry map directly inside apply_seed_lifecycle and removing extract_seed_packages. Also generalize build_tombstone to accept Mapping[str, Any], keeping the key filtering in one place.
Replace generic Mapping typing for recovery_db with a dedicated RecoveryDb shape. This matches runtime guarantees and simplifies recovery iteration code. Use a TypeGuard for registry-shape detection and return RecoveryDb | None from resolve_failure_recovery_db.
Replace iter_db_entries(db, kind) with iter_package_entries(db), since this helper is only used for package traversal. This removes an unused selector parameter and makes call sites more explicit.
Avoid false-positive recovery warnings when a full registry-shaped seed is present but a failed repository has no known entries. Emit a clear generic warning only for compact seed.json inputs where failed-source recovery cannot be guaranteed, and point users to a full registry.json seed for complete recovery. Update and extend registry tests to cover compact-seed warning behavior and registry-seed non-warning behavior.
Update crawl workflow to seed generate_registry from ./.the-registry/registry.json and sync branch state via an extracted shell script. Add .github/workflows/sync_registry_branch.sh to perform compare-first syncing, fallback commit messaging, and push to the-registry. Add pytest coverage for happy path, no-op behavior, and classifier crash fallback using a local bare origin to avoid network pushes.
Not sure what it looks like in the data, but on the site removals are simply not handled (ie. nothing gets removed or even marked as such). Maybe sometimes Will did something manually sometimes but not in recent years. I removed this one ages ago for instance: https://packagecontrol.io/packages/Theme%20-%20Sea%20Lion. |
|
Oh, they just look like pages that don't get downloads. How would I scrape them? It is basically just a page with an outdated LAST SEEN tag, so you go just through all the pages and look for that. Certainly possible. |
|
Actually, this is typically to be installed on the package_control_channel repo; but we don't have enough permissions to run it over there and it'd be just a hassle. |
|
Maybe one day though! 🤞🏻 |
Let's try this one.
Created an orphaned branch the-registry, changed crawl.yml to read the registry from there, and to push it (if it changed) to that branch back.
generate_registry learned lifecycle management. New packages are tagged with "first_seen"; removed packages are re-added as tombstones.
A new tool
generate_seed.pyextracts "first_seen"/"removed" data from workspace.json files or registries.Old data has been scraped from packagecontrol.io; so we have more tombstones in the database than before. Unlikely we have all; don't know what the actual policy was for pc.io.